For HW3 I am not outputting any of the cleaning however I am leaving it in for reference.
The intent of this homework for me was to show how playing around with the descriptive statistics is a great idea for exploring your data. The questions I originally had intended to answer with the graphs lead to new questions and at the end of this has added a new time period for my final project to investigate.
Had I simply ommitted the outline I found in the data or changed my groupings to hide it, I would not have thought to highlight this time in history.
Data
Briefly describe the data
Below is a list of the available sheets in the SIPRI military spending data export. A few of the tabs are the same base information altered to reflect a specific currency or ratio. I am choosing to use the “Current USD” as my source of raw spending numbers as I believe it to be the easiest to understand and relate to. I will also be using “share of GDP” as well as “Share of Govt. spending” to provide context around a nations spending compared to their population they intend on defending as well as compared to overall spending.
Description of data I will use
“Gross domestic product (GDP) is the total monetary or market value of all the finished goods and services produced within a country’s borders in a specific time period.” - investopedia
The Military spending information I am analyzing comes from SIPRI, the Stockholm International Peace Research Institute. SIPRI was established in 1966 for the purpose of “research into conflict, armaments, arms control and disarmament”. The organization is funded by the Swedish government however it regularly works internationally with research centers around the globe. SIPRI collected this particular data set from the official reports of each of the included governments in the form of official publications such as public budgets.
Here is a peek at what the raw information looks like.
Next I delete the first row of NA as well as the Notes column for each tibble.
Code
## trying Purr here to be cleaner, we have not covered this yet so please let me know if this could be better.#tibleList <- lst(rawData_CurrentUSD, rawData_ShareOfGDP, rawData_ShareOfGovSpend)# could pur this with my func i bet## not working ^^rawData_CurrentUSD <- rawData_CurrentUSD[-1,] %>%select(-Notes)rawData_ShareOfGDP <- rawData_ShareOfGDP[-1,] %>%select(-Notes)rawData_ShareOfGovSpend <- rawData_ShareOfGovSpend[-1,] %>%select(-2:-3)#head(rawData_CurrentUSD)#head(rawData_ShareOfGDP)#head(rawData_ShareOfGovSpend)
Tidying Data
Before pivoting my data I wanted to add a column for region. I chose to do this with an algorithm as an exercise in iteration however it would have been more practical to just hard code this column.
Before Pivoting I am adding a column for region
The list of countries where there are NA’s for every year:
Code
Regions <- rawData_CurrentUSD %>%filter(is.na(`1949`))%>%select(Country)#Regions
You will notice that some of these are continents and sub-continents, I wanted to break these into a Region field. The pattern I saw and chose to exploit was the fact that the overarching category would always be followed with a more specific category such as the Africa followed by North Africa values (see earlier print outs to see raw information).
I then added an empty vector into the tibble and populated it according to the Country column using the pattern described above.
Once I had left flags in the region column dictating where each region was starting I could use fill(Region) in order to populate the rest of the column.
Code
emptyRegionCol <-as.numeric(vector(mode ="character",length =length(rawData_CurrentUSD$`1949`)))## doing as numeric here as a hacky way to fill this with NA's, any better way?#adding col to tibble to be populated latermutated_CurrentUSD <- rawData_CurrentUSD %>%mutate(Region =emptyRegionCol,.before=2)#mutated_CurrentUSD#head(mutated_CurrentUSD)# replacing the empty char in region with actual region if 2 NAs appear in a row#iterates along the indices of the Country vectorfor(i inseq_along(mutated_CurrentUSD$Country) ){#if 2 nas in a row then do somethingif(is.na(mutated_CurrentUSD$`1949`[i]) &&is.na(mutated_CurrentUSD$`1949`[i+1]) ){ mutated_CurrentUSD$Region[i+1] <- mutated_CurrentUSD$Country[i+1] #this works because when referencing an index outside of the vector R will return NA - Great to know. }#if the row is a regionif(is.na(mutated_CurrentUSD$`1949`[i]) ){ mutated_CurrentUSD$Region[i+1] <- mutated_CurrentUSD$Country[i] }}mutated_CurrentUSD <- mutated_CurrentUSD %>%fill(Region)#mutated_CurrentUSD
Now that I had my proof of concept for cleaning one of the sheets, in a format I wanted, it was time to clean the other sheets as well. I could have taken the above algorithm and modified it to work for each of the other tabs individually however I took the opportunity to practice with functions and created the below function to clean any of the tabs once they were loaded in and lightly cleaned.
Clean and format function
Code
## creating a function to clean the other sheets##Input: a partially cleaned tibble, must have been read in and had excess rows trimmed##output: tibble with regions correctly labeled and umbrella region categories i.e Africa above south Africa removed##Note: All sub categories contained the main category name in them, will use regex to create groupings in the future.CleanData <-function(inSheet){ emptyRegionCol <-as.numeric(vector(mode ="character",length =nrow(select(inSheet,2))))#adding col to tibble to be populated later output <- inSheet %>%mutate(Region = emptyRegionCol,.before=2)# replacing the empty char in region with actual region if 2 NAs appear in a row#iterates along the indices of the Country vectorfor(i inseq_along(output$Country)){if(is.na(output[[i,3]]) &&is.na(output[[(i+1),3]])) output$Region[i+1] <- output$Country[i+1]if(is.na(output[[i,3]])) output$Region[i+1] <- output$Country[i] } output <- output %>%fill(Region)# removing header rows output <- output %>%filter(!is.na(output[[3]]))}
Below I call the function on each of the tibbles loaded in earlier.
Next I pivot the years, it should be noted that both the pivoting and Converting of notation could be handled by the clean function above, i kept the steps separate so that I could show the progression of the dataframe.
I must also handle the xxx and … notations. From the information page of my data set I can see that the notation is described as follows:
Raw Notation
Meaning
…
Data unavailable
xxx
Country did not exist or was not independent during all or part of the year in question
For now I think I will just keep both as NA but will save this form of the information for future use.
Finally I will convert the column types to their correct representation.
Before:
Code
#head(Cleaned_currentUSD)#head(Cleaned_ShareOfGovSpend)#head(Cleaned_ShareOfGDP)# Year to date#Cleaned_currentUSD$Year<-as_date(Cleaned_currentUSD$Year, format = "%Y",tz=NULL)#Cleaned_ShareOfGovSpend$Year<-as_date(Cleaned_ShareOfGovSpend$Year, format = "%Y",tz=NULL)#Cleaned_ShareOfGDP$Year<-as_date(Cleaned_ShareOfGDP$Year, format = "%Y",tz=NULL)#Year to NumberCleaned_currentUSD$Year<-as.integer(Cleaned_currentUSD$Year)Cleaned_ShareOfGovSpend$Year<-as.integer(Cleaned_ShareOfGovSpend$Year)Cleaned_ShareOfGDP$Year<-as.integer(Cleaned_ShareOfGDP$Year)#Cleaned_currentUSD# value to doubleCleaned_currentUSD$value <-as.numeric(Cleaned_currentUSD$value)Cleaned_ShareOfGovSpend$value<-as.numeric(Cleaned_ShareOfGovSpend$value)Cleaned_ShareOfGDP$value<-as.numeric(Cleaned_ShareOfGDP$value)
mean median and std for each of the tabs Charts for the stats
First I wanted to get an idea of what my data looks like for each of the countries individually. In order to see this I grouped on country name and the types of data we have.
In order to get that done I need to pivot my data again.
# A tibble: 6 × 5
Country Region Year viewOfSpend Spend
<chr> <chr> <int> <chr> <dbl>
1 Algeria North Africa 1949 MilitarySpend_currentUSD NA
2 Algeria North Africa 1949 MilitarySpend_ShareofGovSpend NA
3 Algeria North Africa 1949 MilitarySpend_ShareofGDP NA
4 Algeria North Africa 1950 MilitarySpend_currentUSD NA
5 Algeria North Africa 1950 MilitarySpend_ShareofGovSpend NA
6 Algeria North Africa 1950 MilitarySpend_ShareofGDP NA
Code
#view(combinedData)na.omit(combinedData)
# A tibble: 20,270 × 5
Country Region Year viewOfSpend Spend
<chr> <chr> <int> <chr> <dbl>
1 Algeria North Africa 1963 MilitarySpend_currentUSD 66.4
2 Algeria North Africa 1963 MilitarySpend_ShareofGDP 0.0251
3 Algeria North Africa 1964 MilitarySpend_currentUSD 99.9
4 Algeria North Africa 1964 MilitarySpend_ShareofGDP 0.0350
5 Algeria North Africa 1965 MilitarySpend_currentUSD 106.
6 Algeria North Africa 1965 MilitarySpend_ShareofGDP 0.0342
7 Algeria North Africa 1966 MilitarySpend_currentUSD 105.
8 Algeria North Africa 1966 MilitarySpend_ShareofGDP 0.0351
9 Algeria North Africa 1967 MilitarySpend_currentUSD 99.2
10 Algeria North Africa 1967 MilitarySpend_ShareofGDP 0.0302
# … with 20,260 more rows
Below is a graph depicting the completeness of the information available to us.
# A tibble: 14 × 2
Region numberCountries
<chr> <int>
1 sub-Saharan Africa 47
2 Central Europe 20
3 Western Europe 20
4 Middle East 16
5 Central America and the Caribbean 13
6 South America 11
7 South East Asia 11
8 Eastern Europe 8
9 East Asia 6
10 South Asia 6
11 Central Asia 5
12 North Africa 4
13 Oceania 4
14 North America 2
Due to the number of Regions I may need to focus on some in particular…
# A tibble: 14 × 6
# Groups: Region [14]
Region viewOfSpend mean std min max
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 North America MilitarySpend_… 1.49e5 2.23e5 426. 8.01e5
2 East Asia MilitarySpend_… 2.04e4 4.21e4 0 2.93e5
3 Eastern Europe MilitarySpend_… 1.09e4 3.52e4 0 2.46e5
4 Western Europe MilitarySpend_… 7.14e3 1.28e4 0 7.34e4
5 Middle East MilitarySpend_… 4.91e3 9.59e3 0 8.72e4
6 South Asia MilitarySpend_… 4.46e3 1.17e4 2.48 7.66e4
7 Oceania MilitarySpend_… 2.83e3 6.29e3 0 3.18e4
8 South America MilitarySpend_… 2.02e3 4.57e3 0 3.69e4
9 South East Asia MilitarySpend_… 1.94e3 3.16e3 0 3.30e4
10 North Africa MilitarySpend_… 1.36e3 2.21e3 3.71 1.04e4
11 Central Europe MilitarySpend_… 1.14e3 2.20e3 0 1.74e4
12 Central Asia MilitarySpend_… 3.36e2 5.92e2 0 2.55e3
13 Central America and the Caribbean MilitarySpend_… 2.86e2 9.54e2 0 8.68e3
14 sub-Saharan Africa MilitarySpend_… 2.58e2 6.35e2 0 6.85e3
# removing NAs here because I cant plot them, I dont want to replace them with other values because it is still of interest when countries do not display thier data.
Something crazy going on with the middle east spend…
# A tibble: 6 × 5
# Groups: Region, viewOfSpend, Year [6]
Country Region Year viewOfSpend Spend
<chr> <fct> <int> <chr> <dbl>
1 Kuwait Middle East 1991 MilitarySpend_ShareofGDP 1.17
2 Kuwait Middle East 1990 MilitarySpend_ShareofGDP 0.485
3 Kuwait Middle East 1992 MilitarySpend_ShareofGDP 0.318
4 Israel Middle East 1975 MilitarySpend_ShareofGDP 0.305
5 Israel Middle East 1976 MilitarySpend_ShareofGDP 0.292
6 Israel Middle East 1973 MilitarySpend_ShareofGDP 0.279
I initially thought this was a typo since there were two 1s in a row and it was such an insanely high spend. A share of GDP of 1.173 means Kuwait spent 117.3% of their GDP in that year. Upon further investigation, this is actually true and it was driven by the Persian Gulf War(aka Gulf War). The Gulf War was a conflict triggered by Iraqs invasion of Kuwait in 1990. This invasion was the first major international crisis since the Cold War and will certainly be a topic I add to the final analysis.
https://www.britannica.com/event/Persian-Gulf-War
Loosely I will plot Iraq, US and Kuwait spend from 1985-1995 to explore this time period.
Code
combinedData%>%filter(Year >=1985& Year<=1995)%>%filter(Country %in%c("United States of America","Iraq","Kuwait"))%>%ggplot(mapping=aes(y = Spend, x = Year))+geom_point(aes(color = Country, alpha=0.9),na.rm =FALSE)+facet_wrap(vars(`viewOfSpend`),scales ="free_y")
A new discovery! Iraq chose not to report their spending during the Gulf war.
# A tibble: 87 × 5
Country Region Year viewOfSpend Spend
<chr> <chr> <int> <chr> <dbl>
1 Iraq Middle East 1961 MilitarySpend_ShareofGovSpend NA
2 Iraq Middle East 1962 MilitarySpend_ShareofGovSpend NA
3 Iraq Middle East 1963 MilitarySpend_ShareofGovSpend NA
4 Iraq Middle East 1964 MilitarySpend_ShareofGovSpend NA
5 Iraq Middle East 1965 MilitarySpend_ShareofGovSpend NA
6 Iraq Middle East 1966 MilitarySpend_ShareofGovSpend NA
7 Iraq Middle East 1967 MilitarySpend_ShareofGovSpend NA
8 Iraq Middle East 1968 MilitarySpend_ShareofGovSpend NA
9 Iraq Middle East 1969 MilitarySpend_ShareofGovSpend NA
10 Iraq Middle East 1970 MilitarySpend_ShareofGovSpend NA
# … with 77 more rows
# A tibble: 14 × 6
# Groups: Region [14]
Region viewOfSpend mean std min max
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Middle East MilitarySpend… 0.161 0.0757 1.83e-2 0.575
2 South Asia MilitarySpend… 0.122 0.0511 3.65e-2 0.274
3 Eastern Europe MilitarySpend… 0.115 0.0887 7.71e-3 0.350
4 sub-Saharan Africa MilitarySpend… 0.103 0.260 3.90e-3 5.82
5 South East Asia MilitarySpend… 0.0984 0.0749 7.83e-3 0.392
6 North Africa MilitarySpend… 0.0973 0.0369 2.81e-2 0.178
7 East Asia MilitarySpend… 0.0837 0.0503 1.73e-2 0.225
8 South America MilitarySpend… 0.0728 0.0392 1.73e-2 0.348
9 North America MilitarySpend… 0.0600 0.0386 2.51e-2 0.124
10 Central Asia MilitarySpend… 0.0518 0.0269 1.36e-2 0.158
11 Oceania MilitarySpend… 0.0441 0.0155 1.54e-2 0.0879
12 Central Europe MilitarySpend… 0.0439 0.0283 6.72e-4 0.290
13 Central America and the Caribbean MilitarySpend… 0.0354 0.0218 2.20e-3 0.214
14 Western Europe MilitarySpend… 0.0348 0.0189 8.58e-3 0.164
Code
library(forcats)statsViz2<- byRegionStats%>%mutate(test =fct_reorder(Region,desc(mean)))%>%ggplot(mapping=aes(y = test, x = mean))+geom_point(aes(color = Region, alpha=0.9, size=max)) +guides(alpha="none",color="none") +labs(title ="Certain Regions on average spend more than others" ,caption ="Data from SPIRI Military Expediture Database")
Error in mutate(., test = fct_reorder(Region, desc(mean))): object 'byRegionStats' not found
Code
statsViz2
Error in eval(expr, envir, enclos): object 'statsViz2' not found
From this we can see the top 6 spenders reside in a cluster of there own when you look at their mean and max spend.
# A tibble: 1,022 × 4
# Groups: Region, viewOfSpend [14]
Region viewOfSpend Year mean
<chr> <chr> <int> <dbl>
1 North America MilitarySpend_currentUSD 2021 413561.
2 North America MilitarySpend_currentUSD 2020 400839.
3 North America MilitarySpend_currentUSD 2011 386841.
4 North America MilitarySpend_currentUSD 2010 378660.
5 North America MilitarySpend_currentUSD 2019 378369.
6 North America MilitarySpend_currentUSD 2012 372829.
7 North America MilitarySpend_currentUSD 2009 362427.
8 North America MilitarySpend_currentUSD 2018 352610.
9 North America MilitarySpend_currentUSD 2013 348872.
10 North America MilitarySpend_currentUSD 2008 338049.
# … with 1,012 more rows
Code
byRegionplot<- byRegionStats_year%>%ggplot(mapping=aes(x = Year, y = mean))+geom_point(aes(color = Region, alpha=0.9,)) +theme_light() +guides(alpha="none") +labs(title ="East asia and NA dominate raw spend" ,caption ="Data from SPIRI Military Expediture Database")byRegionplot
To be a little less noisy, showing the top 6 spenders over time
Error in `filter()`:
! Problem while computing `..1 = Region == top6_mean`.
ℹ The error occurred in group 1: Region = "Central America and the Caribbean",
viewOfSpend = "MilitarySpend_currentUSD".
Caused by error in `mask$eval_all_filter()`:
! object 'top6_mean' not found
Code
byRegionplot_top6<- byRegionStats_year_top6%>%ggplot(mapping=aes(x = Year, y = mean))+geom_point(aes(color = Region, alpha=0.9,)) +facet_wrap(vars(`Region`),scales ="free_y") +labs(title ="Raw Mean Spend in USD for top 6 Spenders" ,caption ="Data from SPIRI Military Expediture Database")
Error in ggplot(., mapping = aes(x = Year, y = mean)): object 'byRegionStats_year_top6' not found
Code
byRegionplot_top6
Error in eval(expr, envir, enclos): object 'byRegionplot_top6' not found
From here I choose to investigate Eastern Europe.
Code
byRegionStats_year_EasternEurope<-combinedData%>%group_by(Region,viewOfSpend,Year)%>%filter(viewOfSpend =='MilitarySpend_currentUSD')%>%filter(Region =='Eastern Europe')EasternEuropePlot<- byRegionStats_year_EasternEurope%>%ggplot(mapping=aes(x = Year, y = Spend))+geom_point(aes(color = Country, alpha=0.9)) +facet_wrap(vars(`Country`)) +labs(title ="Raw Mean Spend in Eastern Europe" ,caption ="Data from SPIRI Military Expediture Database")EasternEuropePlot
I left the scaling fixed so that you could see how high USSR spending was and then they become Russia and spend slowley increases but not to the levels of the USSR.
We see here that USSR spending looks to be about late 1990s, following some research I found…
Code
startOfNoData<-1985endofNoData<-1995USSRSpendPlot<-combinedData%>%filter(Country %in%c("USSR","Russia"))%>%filter(viewOfSpend =='MilitarySpend_currentUSD')%>%ggplot(mapping=aes( x = Year,y = Spend,color =`Country`))+geom_point()USSRSpendPlot+geom_vline(xintercept =1987,color="red")+annotate("text",x=1987,y=0,label="1987",hjust=1,vjust=-0,color="red")+geom_vline(xintercept =1990,color="red")+annotate("text",x=1990,y=0,label="1990",hjust=-0,vjust=-0,color="red")+geom_vline(xintercept =1945,color="black")+annotate("text",x=1945,y=0,label="Start of Cold War, Defeat of Germany and Japan",hjust=-0,vjust=-0,color="black",angle='90')+geom_vline(xintercept =1985,color="black")+annotate("text",x=1985,y=0,label="Mikhail Gorbachev becomes leader of the USSR",hjust=-0.1,vjust=-0,color="black",angle='90')+geom_vline(xintercept =1991,color="black")+annotate("text",x=1991,y=0,label="1991 End of Cold War, USSR disbanded",hjust=-0.1,vjust=1,color="black",angle='90')
From this view of the information it leads one to believe the USSR spending was very high during the cold war, I am unsure why we were given access to their spending towards the end of the cold war, it could have been a show of good faith, more research to come. Once the war was declared over, Russian spending fell to 0 in the following year and slowly started creeping back up.
Another area of research i get from this would be what happened in 2003- 2004 to spike the trend in Russian spending.
HW Questions
what questions are left unanswered?
Russian spending during 2003-2004
The rest of my outlined possible questions from HW2 will still be answered, their answer was just not obvious from the summary statistics.
what may be unclear?
In order to improve some of these charts I could include some information about global spending at the time or perhaps augment the information to display it as a portion of global military spend? - If i showed percent of global(or engaged parties) military spend it would be interesting to show if just spending more would win the wars.
How could I improve these vizualizations
I could change the view of some of these visualizations so that they are more appealing. I liked the use of the violin plot for the Gulf War analysis and will use more variety in graphs for the final.
Conclusion
“Every gun that is made, every warship launched, every rocket fired signifies in the final sense, a theft from those who hunger and are not fed, those who are cold and are not clothed. > > This world in arms is not spending money alone. It is spending the sweat of its laborers, the genius of its scientists, the hopes of its children. This is not a way of life at all in any > true sense. Under the clouds of war, it is humanity hanging on a cross of iron.”
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.
https://www.investopedia.com/terms/g/gdp.asp
Appendix: Full descriptions as provided by SIPRI:
Introduction
Estimates of world, regional and sub-regional totals in constant (2019) US$ (billions).
Data for military expenditure by country in current price local currency, presented according to each country’s financial year.
Data for military expenditure by country in current price local currency, presented according to calendar year.
Data for military expenditure by country in constant price (2019) US$ (millions), presented according to calendar year, and in current US$m. for 2020.
Data for military expenditure by country in current US$ (millions), presented according to calendar year.
Data for military expenditure by country as a share of GDP, presented according to calendar year.
Data for military expenditure per capita, in current US$, presented according to calender year. (1988-2020 only)
Data for military expenditure as a percentage of general government expenditure. (1988-2020 only)
Source Code
---title: "Global Military Spending - An Analysis"description: "Hw_3"author: "Julian Castoro"date: "`r Sys.Date()`"format: html: toc: true code-fold: true code-copy: true code-tools: trueoutput: distill::distill_articlecategories: - Homework3 - Julian Castoro---```{r}#| label: setup#| warning: false#| message: falselibrary(tidyverse)library(lubridate)library(ggplot2)library(gganimate)library(readxl)library(dplyr)library(purrr)library(lubridate)library(leaflet)options(digits =3,decimals=2)options(scipen =999)knitr::opts_chunk$set(echo =TRUE, warning=FALSE, message=FALSE)```## IntroductionFor HW3 I am not outputting any of the cleaning however I am leaving it in for reference.The intent of this homework for me was to show how playing around with the descriptive statistics is a great idea for exploring your data. The questions I originally had intended to answer with the graphs lead to new questions and at the end of this has added a new time period for my final project to investigate.Had I simply ommitted the outline I found in the data or changed my groupings to hide it, I would not have thought to highlight this time in history.## Data### Briefly describe the dataBelow is a list of the available sheets in the SIPRI military spending data export. A few of the tabs are the same base information altered to reflect a specific currency or ratio. I am choosing to use the "Current USD" as my source of raw spending numbers as I believe it to be the easiest to understand and relate to. I will also be using "share of GDP" as well as "Share of Govt. spending" to provide context around a nations spending compared to their population they intend on defending as well as compared to overall spending. ### Description of data I will use "Gross domestic product (GDP) is the total monetary or market value of all the finished goods and services produced within a country’s borders in a specific time period." - investopedia ```{r}sheets <-excel_sheets("_data/SIPRI-Milex-data-1949-2021.xlsx")#sheets```### Where did this information come from?The Military spending information I am analyzing comes from **SIPRI**, the **S**tockholm **I**nternational **P**eace **R**esearch **I**nstitute. SIPRI was established in 1966 for the purpose of "research into conflict, armaments, arms control and disarmament". The organization is funded by the Swedish government however it regularly works internationally with research centers around the globe. SIPRI collected this particular data set from the official reports of each of the included governments in the form of official publications such as public budgets.Here is a peek at what the raw information looks like.```{r}#| warning: false#| message: falserawrawData_ShareOfGovSpend <-read_excel("_data/SIPRI-Milex-data-1949-2021.xlsx",sheet=sheets[9])#head(rawrawData_ShareOfGovSpend)```### Reading in Raw informationReading in raw informationThe data provided by the Stockholm International Peace Research Institute(sipri) was separated into multiple tabs in one excel .xlsx file. In order to begin working I imported the tabs I planned to utilize, skipping over some of the notes and title rows at the start of each tab.```{r}rawData_CurrentUSD <-read_excel("_data/SIPRI-Milex-data-1949-2021.xlsx",sheet=sheets[6],skip=5)rawData_ShareOfGDP <-read_excel("_data/SIPRI-Milex-data-1949-2021.xlsx",sheet=sheets[7],skip=5)rawData_ShareOfGovSpend <-read_excel("_data/SIPRI-Milex-data-1949-2021.xlsx",sheet=sheets[9],skip=7)#head(rawData_CurrentUSD)#head(rawData_ShareOfGDP)#head(rawData_ShareOfGovSpend)```Next I delete the first row of NA as well as the Notes column for each tibble.```{r}## trying Purr here to be cleaner, we have not covered this yet so please let me know if this could be better.#tibleList <- lst(rawData_CurrentUSD, rawData_ShareOfGDP, rawData_ShareOfGovSpend)# could pur this with my func i bet## not working ^^rawData_CurrentUSD <- rawData_CurrentUSD[-1,] %>%select(-Notes)rawData_ShareOfGDP <- rawData_ShareOfGDP[-1,] %>%select(-Notes)rawData_ShareOfGovSpend <- rawData_ShareOfGovSpend[-1,] %>%select(-2:-3)#head(rawData_CurrentUSD)#head(rawData_ShareOfGDP)#head(rawData_ShareOfGovSpend) ```### Tidying DataBefore pivoting my data I wanted to add a column for region. I chose to do this with an algorithm as an exercise in iteration however it would have been more practical to just hard code this column.#### Before Pivoting I am adding a column for regionThe list of countries where there are NA's for every year:```{r}Regions <- rawData_CurrentUSD %>%filter(is.na(`1949`))%>%select(Country)#Regions```You will notice that some of these are continents and sub-continents, I wanted to break these into a `Region` field. The pattern I saw and chose to exploit was the fact that the overarching category would always be followed with a more specific category such as the **Africa** followed by **North Africa** values (see earlier print outs to see raw information). I then added an empty vector into the tibble and populated it according to the Country column using the pattern described above.Once I had left flags in the region column dictating where each region was starting I could use `fill(Region)` in order to populate the rest of the column.```{r}emptyRegionCol <-as.numeric(vector(mode ="character",length =length(rawData_CurrentUSD$`1949`)))## doing as numeric here as a hacky way to fill this with NA's, any better way?#adding col to tibble to be populated latermutated_CurrentUSD <- rawData_CurrentUSD %>%mutate(Region =emptyRegionCol,.before=2)#mutated_CurrentUSD#head(mutated_CurrentUSD)# replacing the empty char in region with actual region if 2 NAs appear in a row#iterates along the indices of the Country vectorfor(i inseq_along(mutated_CurrentUSD$Country) ){#if 2 nas in a row then do somethingif(is.na(mutated_CurrentUSD$`1949`[i]) &&is.na(mutated_CurrentUSD$`1949`[i+1]) ){ mutated_CurrentUSD$Region[i+1] <- mutated_CurrentUSD$Country[i+1] #this works because when referencing an index outside of the vector R will return NA - Great to know. }#if the row is a regionif(is.na(mutated_CurrentUSD$`1949`[i]) ){ mutated_CurrentUSD$Region[i+1] <- mutated_CurrentUSD$Country[i] }}mutated_CurrentUSD <- mutated_CurrentUSD %>%fill(Region)#mutated_CurrentUSD ```Now that I had my proof of concept for cleaning one of the sheets, in a format I wanted, it was time to clean the other sheets as well. I could have taken the above algorithm and modified it to work for each of the other tabs individually however I took the opportunity to practice with functions and created the below function to clean any of the tabs once they were loaded in and lightly cleaned.# Clean and format function```{r}## creating a function to clean the other sheets##Input: a partially cleaned tibble, must have been read in and had excess rows trimmed##output: tibble with regions correctly labeled and umbrella region categories i.e Africa above south Africa removed##Note: All sub categories contained the main category name in them, will use regex to create groupings in the future.CleanData <-function(inSheet){ emptyRegionCol <-as.numeric(vector(mode ="character",length =nrow(select(inSheet,2))))#adding col to tibble to be populated later output <- inSheet %>%mutate(Region = emptyRegionCol,.before=2)# replacing the empty char in region with actual region if 2 NAs appear in a row#iterates along the indices of the Country vectorfor(i inseq_along(output$Country)){if(is.na(output[[i,3]]) &&is.na(output[[(i+1),3]])) output$Region[i+1] <- output$Country[i+1]if(is.na(output[[i,3]])) output$Region[i+1] <- output$Country[i] } output <- output %>%fill(Region)# removing header rows output <- output %>%filter(!is.na(output[[3]]))}```Below I call the function on each of the tibbles loaded in earlier.```{r}Cleaned_currentUSD <-CleanData(rawData_CurrentUSD)Cleaned_ShareOfGovSpend <-CleanData(rawData_ShareOfGovSpend)Cleaned_ShareOfGDP <-CleanData(rawData_ShareOfGDP)#head(Cleaned_currentUSD)#head(Cleaned_ShareOfGovSpend)#head(Cleaned_ShareOfGDP)```## Pivoting years and Next I pivot the years, it should be noted that both the pivoting and Converting of notation could be handled by the clean function above, i kept the steps separate so that I could show the progression of the dataframe.I must also handle the xxx and ... notations. From the information page of my data set I can see that the notation is described as follows:Raw Notation | Meaning------------- | -------------... | Data unavailablexxx | Country did not exist or was not independent during all or part of the year in questionFor now I think I will just keep both as NA but will save this form of the information for future use.```{r}Cleaned_currentUSD <- Cleaned_currentUSD %>%pivot_longer(cols=3:ncol(Cleaned_currentUSD),names_to ="Year",values_drop_na =FALSE)%>%na_if("...") %>%na_if("xxx")Cleaned_ShareOfGovSpend <- Cleaned_ShareOfGovSpend %>%pivot_longer(cols=3:ncol(Cleaned_ShareOfGovSpend),names_to ="Year",values_drop_na =FALSE)%>%na_if("...") %>%na_if("xxx")Cleaned_ShareOfGDP <- Cleaned_ShareOfGDP %>%pivot_longer(cols=3:ncol(Cleaned_ShareOfGDP),names_to ="Year",values_drop_na =FALSE)%>%na_if("...") %>%na_if("xxx")```### Converting column typesFinally I will convert the column types to their correct representation.Before:```{r}#head(Cleaned_currentUSD)#head(Cleaned_ShareOfGovSpend)#head(Cleaned_ShareOfGDP)# Year to date#Cleaned_currentUSD$Year<-as_date(Cleaned_currentUSD$Year, format = "%Y",tz=NULL)#Cleaned_ShareOfGovSpend$Year<-as_date(Cleaned_ShareOfGovSpend$Year, format = "%Y",tz=NULL)#Cleaned_ShareOfGDP$Year<-as_date(Cleaned_ShareOfGDP$Year, format = "%Y",tz=NULL)#Year to NumberCleaned_currentUSD$Year<-as.integer(Cleaned_currentUSD$Year)Cleaned_ShareOfGovSpend$Year<-as.integer(Cleaned_ShareOfGovSpend$Year)Cleaned_ShareOfGDP$Year<-as.integer(Cleaned_ShareOfGDP$Year)#Cleaned_currentUSD# value to doubleCleaned_currentUSD$value <-as.numeric(Cleaned_currentUSD$value)Cleaned_ShareOfGovSpend$value<-as.numeric(Cleaned_ShareOfGovSpend$value)Cleaned_ShareOfGDP$value<-as.numeric(Cleaned_ShareOfGDP$value)``````{r}#view(Cleaned_ShareOfGovSpend)=Cleaned_currentUSD <- Cleaned_currentUSD %>%rename(MilitarySpend_currentUSD=value)Cleaned_ShareOfGovSpend <- Cleaned_ShareOfGovSpend%>%rename(MilitarySpend_ShareofGovSpend=value)Cleaned_ShareOfGDP <- Cleaned_ShareOfGDP%>%rename(MilitarySpend_ShareofGDP=value)#head(Cleaned_currentUSD)#head(Cleaned_ShareOfGovSpend)#head(Cleaned_ShareOfGDP)```## join of the different tabsI think we will want to do a full join as we dont want to lose any information.```{r}#nrow(Cleaned_currentUSD)#nrow(Cleaned_ShareOfGovSpend)#nrow(Cleaned_ShareOfGDP)MilitarySpend<-full_join(Cleaned_currentUSD,Cleaned_ShareOfGovSpend,by=c("Country","Year","Region"))MilitarySpend <-full_join(MilitarySpend,Cleaned_ShareOfGDP,by=c("Country","Year","Region"))#head(MilitarySpend)#nrow(MilitarySpend)```## HW3### descriptive statsmean median and std for each of the tabsCharts for the statsFirst I wanted to get an idea of what my data looks like for each of the countries individually. In order to see this I grouped on country name and the types of data we have.In order to get that done I need to pivot my data again.```{r}combinedData <-MilitarySpend %>%pivot_longer(c(MilitarySpend_currentUSD,MilitarySpend_ShareofGovSpend,MilitarySpend_ShareofGDP),names_to ="viewOfSpend",values_to ="Spend")head(combinedData)#view(combinedData)na.omit(combinedData)```Below is a graph depicting the completeness of the information available to us.Total number of countries analyzed:```{r}combinedData%>%distinct(Country)%>%count()```Number of countries by region:```{r}combinedData%>%group_by(Region)%>%summarise(numberCountries =n_distinct(Country))%>%arrange(desc(numberCountries))```Due to the number of Regions I may need to focus on some in particular...by year, how many countries had data available```{r}```Statistics by region#MilitarySpend_currentUSD```{r}byRegionStats_USD<-combinedData%>%group_by(Region,viewOfSpend)%>%summarize(mean =mean(Spend, na.rm=TRUE,sigfig=1),std =sd(Spend, na.rm=TRUE),min =min(Spend, na.rm=TRUE),max =max(Spend, na.rm=TRUE))%>%arrange(desc(mean))%>%filter(viewOfSpend=="MilitarySpend_currentUSD")byRegionStats_USD```#MilitarySpend_ShareOfGDP```{r}byRegionStats_GDP<-combinedData%>%group_by(Region,viewOfSpend,Year)%>%summarize(mean =mean(Spend, na.rm=TRUE,sigfig=1),std =sd(Spend, na.rm=TRUE),min =min(Spend, na.rm=TRUE),max =max(Spend, na.rm=TRUE))%>%arrange(desc(mean))%>%filter(viewOfSpend=="MilitarySpend_ShareofGDP")byRegionStats_GDP``````{r}byRegionStats_GDPbyRegionStats_GDP<-combinedData%>%group_by(Region,viewOfSpend,Year)%>%filter(viewOfSpend=="MilitarySpend_ShareofGDP")byRegionStats_GDP$Region<-as.factor(byRegionStats_GDP$Region)violin_gdp<-na.omit(byRegionStats_GDP)%>%ggplot(aes(x=Region,y=Spend))+geom_violin()violin_gdp# removing NAs here because I cant plot them, I dont want to replace them with other values because it is still of interest when countries do not display thier data.```Something crazy going on with the middle east spend...Investigation:```{r}byRegionStats_GDP%>%filter(Region=="Middle East")%>%arrange(desc(Spend))%>%head()```I initially thought this was a typo since there were two 1s in a row and it was such an insanely high spend. A share of GDP of 1.173 means Kuwait spent 117.3% of their GDP in that year. Upon further investigation, this is actually true and it was driven by the Persian Gulf War(aka Gulf War). The Gulf War was a conflict triggered by Iraqs invasion of Kuwait in 1990. This invasion was the first major international crisis since the Cold War and will certainly be a topic I add to the final analysis.https://www.britannica.com/event/Persian-Gulf-WarLoosely I will plot Iraq, US and Kuwait spend from 1985-1995 to explore this time period.```{r}combinedData%>%filter(Year >=1985& Year<=1995)%>%filter(Country %in%c("United States of America","Iraq","Kuwait"))%>%ggplot(mapping=aes(y = Spend, x = Year))+geom_point(aes(color = Country, alpha=0.9),na.rm =FALSE)+facet_wrap(vars(`viewOfSpend`),scales ="free_y")```A new discovery! Iraq chose not to report their spending during the Gulf war.```{r}combinedData%>%filter(Country %in%c("Iraq"))%>%filter(Year>1960&is.na(Spend))startOfNoData<-1982endofNoData<-2003iraqMissingDataPLOT<-combinedData%>%filter(Country %in%c("Iraq"))%>%ggplot(mapping=aes( x = Spend,y = Year))+geom_point()+facet_wrap(vars(`viewOfSpend`),scales ="free_x")+scale_y_continuous(labels=seq(1949,2020,5),breaks =seq(1949,2020,5))iraqMissingDataPLOT+geom_hline(yintercept =1982,color="red")+geom_hline(yintercept =2003,color="red")+annotate("text",x=0,y=1982,label="1982",hjust=-0.1,vjust=-0.1,color="red")+annotate("text",x=0,y=2003,label="2003",hjust=-0.1,vjust=1.1,color="red")```Below I continue to explore the dataset. This might be too much for HW3 however I was doing it anyway for the final so I chose to include it.#MilitarySpend_ShareofGovSpendNow to explore the statistics within the Share of Gov Spend tab.```{r}byRegionStats_GovSpend<-combinedData%>%group_by(Region,viewOfSpend)%>%summarize(mean =mean(Spend, na.rm=TRUE,sigfig=1),std =sd(Spend, na.rm=TRUE),min =min(Spend, na.rm=TRUE),max =max(Spend, na.rm=TRUE))%>%arrange(desc(mean))%>%filter(viewOfSpend=="MilitarySpend_ShareofGovSpend")byRegionStats_GovSpend``````{r}library(forcats)statsViz2<- byRegionStats%>%mutate(test =fct_reorder(Region,desc(mean)))%>%ggplot(mapping=aes(y = test, x = mean))+geom_point(aes(color = Region, alpha=0.9, size=max)) +guides(alpha="none",color="none") +labs(title ="Certain Regions on average spend more than others" ,caption ="Data from SPIRI Military Expediture Database")statsViz2```From this we can see the top 6 spenders reside in a cluster of there own when you look at their mean and max spend.Top 6```{r}top6_mean <- byRegionStats %>%head()%>%pull(Region)top6_mean```## Faceted by Region, show mean spend of each country```{r , 123}byRegionStats_year<-combinedData%>%group_by(Region,viewOfSpend,Year)%>%summarize(mean =mean(Spend, na.rm=TRUE))%>%filter(viewOfSpend =='MilitarySpend_currentUSD')%>%arrange(desc(mean))byRegionStats_yearbyRegionplot<- byRegionStats_year%>%ggplot(mapping=aes(x = Year, y = mean))+geom_point(aes(color = Region, alpha=0.9,)) +theme_light() +guides(alpha="none") +labs(title ="East asia and NA dominate raw spend" ,caption ="Data from SPIRI Military Expediture Database")byRegionplot```## To be a little less noisy, showing the top 6 spenders over time```{r warning=FALSE,error=TRUE}byRegionStats_year_top6<-combinedData%>%group_by(Region,viewOfSpend,Year)%>%summarize(mean =mean(Spend, na.rm=TRUE))%>%filter(viewOfSpend =='MilitarySpend_currentUSD')%>%filter(Region == top6_mean)%>%arrange(desc(mean))byRegionplot_top6<- byRegionStats_year_top6%>%ggplot(mapping=aes(x = Year, y = mean))+geom_point(aes(color = Region, alpha=0.9,)) +facet_wrap(vars(`Region`),scales ="free_y") +labs(title ="Raw Mean Spend in USD for top 6 Spenders" ,caption ="Data from SPIRI Military Expediture Database")byRegionplot_top6```From here I choose to investigate Eastern Europe.```{r warning=FALSE,error=TRUE}byRegionStats_year_EasternEurope<-combinedData%>%group_by(Region,viewOfSpend,Year)%>%filter(viewOfSpend =='MilitarySpend_currentUSD')%>%filter(Region =='Eastern Europe')EasternEuropePlot<- byRegionStats_year_EasternEurope%>%ggplot(mapping=aes(x = Year, y = Spend))+geom_point(aes(color = Country, alpha=0.9)) +facet_wrap(vars(`Country`)) +labs(title ="Raw Mean Spend in Eastern Europe" ,caption ="Data from SPIRI Military Expediture Database")EasternEuropePlot```I left the scaling fixed so that you could see how high USSR spending was and then they become Russia and spend slowley increases but not to the levels of the USSR.We see here that USSR spending looks to be about late 1990s, following some research I found...```{r warning=FALSE,error=TRUE}startOfNoData<-1985endofNoData<-1995USSRSpendPlot<-combinedData%>%filter(Country %in%c("USSR","Russia"))%>%filter(viewOfSpend =='MilitarySpend_currentUSD')%>%ggplot(mapping=aes( x = Year,y = Spend,color =`Country`))+geom_point()USSRSpendPlot+geom_vline(xintercept =1987,color="red")+annotate("text",x=1987,y=0,label="1987",hjust=1,vjust=-0,color="red")+geom_vline(xintercept =1990,color="red")+annotate("text",x=1990,y=0,label="1990",hjust=-0,vjust=-0,color="red")+geom_vline(xintercept =1945,color="black")+annotate("text",x=1945,y=0,label="Start of Cold War, Defeat of Germany and Japan",hjust=-0,vjust=-0,color="black",angle='90')+geom_vline(xintercept =1985,color="black")+annotate("text",x=1985,y=0,label="Mikhail Gorbachev becomes leader of the USSR",hjust=-0.1,vjust=-0,color="black",angle='90')+geom_vline(xintercept =1991,color="black")+annotate("text",x=1991,y=0,label="1991 End of Cold War, USSR disbanded",hjust=-0.1,vjust=1,color="black",angle='90')```From this view of the information it leads one to believe the USSR spending was very high during the cold war, I am unsure why we were given access to their spending towards the end of the cold war, it could have been a show of good faith, more research to come. Once the war was declared over, Russian spending fell to 0 in the following year and slowly started creeping back up.Another area of research i get from this would be what happened in 2003- 2004 to spike the trend in Russian spending.## HW Questions### what questions are left unanswered?Russian spending during 2003-2004The rest of my outlined possible questions from HW2 will still be answered, their answer was just not obvious from the summary statistics.### what may be unclear?In order to improve some of these charts I could include some information about global spending at the time or perhaps augment the information to display it as a portion of global military spend? - If i showed percent of global(or engaged parties) military spend it would be interesting to show if just spending more would win the wars.### How could I improve these vizualizationsI could change the view of some of these visualizations so that they are more appealing. I liked the use of the violin plot for the Gulf War analysis and will use more variety in graphs for the final.## Conclusion> “Every gun that is made, every warship launched, every rocket fired signifies in the final sense, a theft from those who hunger and are not fed, those who are cold and are not clothed. > > This world in arms is not spending money alone. It is spending the sweat of its laborers, the genius of its scientists, the hopes of its children. This is not a way of life at all in any > true sense. Under the clouds of war, it is humanity hanging on a cross of iron.”Dwight D. EisenhowerCitations:https://www.goodreads.com/quotes/tag/military-budget#:~:text=%E2%80%9CEvery%20gun%20that%20is%20made,is%20not%20spending%20money%20alone.https://www.sipri.org/aboutR Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.Wickham, H., & Grolemund, G. (2016). R for data science: Visualize, model, transform, tidy, and import data. OReilly Media.https://www.investopedia.com/terms/g/gdp.aspAppendix:Full descriptions as provided by SIPRI:1. Introduction2. Estimates of world, regional and sub-regional totals in constant (2019) US$ (billions).3. Data for military expenditure by country in current price local currency, presented according to each country's financial year.4. Data for military expenditure by country in current price local currency, presented according to calendar year.5. Data for military expenditure by country in constant price (2019) US$ (millions), presented according to calendar year, and in current US$m. for 2020.6. Data for military expenditure by country in current US$ (millions), presented according to calendar year.7. Data for military expenditure by country as a share of GDP, presented according to calendar year.8. Data for military expenditure per capita, in current US$, presented according to calender year. (1988-2020 only)9. Data for military expenditure as a percentage of general government expenditure. (1988-2020 only)